Privacy-preserving parallel kNN classification algorithm using index-based filtering in cloud computing

With the development of cloud computing, interest in database outsourcing has recently increased. In cloud computing, it is necessary to protect the sensitive information of data owners and authorized users. For this, data mining techniques over encrypted data have been studied to protect the original database, user queries and data access patterns. The typical data mining technique is kNN classification which is widely used for data analysis and artificial intelligence. However, existing works do not provide a sufficient level of efficiency for a large amount of encrypted data. To solve this problem, in this paper, we propose a privacy-preserving parallel kNN classification algorithm. To reduce the computation cost for encryption, we propose an improved secure protocol by using an encrypted random value pool. To reduce the query processing time, we not only design a parallel algorithm, but also adopt a garbled circuit. In addition, the security analysis of the proposed algorithm is performed to prove its data protection, query protection, and access pattern protection. Through our performance evaluation, the proposed algorithm shows about 2∼25 times better performance compared with existing algorithms.


Introduction
With the growing popularity of cloud computing, there has been growing interest in outsourcing databases. Cloud computing provides a service that allows internet-connected users to use virtual computing resources such as storage, computation, and network. Thus, a cloud service provider can maintain computing resources rapidly and flexibly. A data owner can reduce efforts to purchase, install, and expand computing systems, and mitigate the constraints of physical space. Cloud computing is attracting a lot of attention from individuals and companies because it can reduce the cost of system maintenance and data management, and can utilize computing resources needed without expertise. Meanwhile, we should consider three requirements in an outsourced database. First, it is necessary to protect the database because the database contains sensitive information of the data owner [1,2]. Second, the query and the query result should not be exposed because personal information related to user preference The motivation of this paper is as follows. First, the existing algorithms suffer from high computational cost by using encrypted binary array to perform comparison operations. Therefore, we aim at reducing computational cost by proposing secure comparison protocol based on Yao's garbled circuit. Second, the existing algorithms require high data encryption cost. To deal with this problem, we propose an improved secure protocol by using an encrypted random value pool. Finally, to the best of our knowledge, there is no existing parallel kNN classification algorithm. We aim at designing a parallel kNN classification algorithm for processing a large amount of encrypted data.
The contributions of this paper are as follows.
• Supporting privacy preservation: By processing queries using homomorphic encryption without data decryption, we can protect the confidentiality of both data and user's queries while hiding data access patterns from an attacker.
• Reducing computation cost: By using the improved secure protocol based on an encrypted random value pool, we can reduce the high computation cost of the random value generation for the data encryption.
• Improving the performance of kNN classification: By proposing a new parallel kNN classification algorithm, we can reduce the amount of processing time for kNN classification.
The rest of this paper is as follows. In Section 2, we introduce the existing works on kNN classification algorithms over the encrypted database. In Section 3, we describe the overall system architecture and propose secure protocols for the proposed parallel kNN classification algorithm. In Section 4, we propose a parallel kNN classification algorithm that preserves both data and query privacy on the cloud. In Section 5, we provide the security proof of our kNN classification algorithm. In Section 6, we perform a performance analysis of the proposed algorithm. In Section 7, we describe the impact of the proposed parallel classification algorithm as a discussion. Finally, in Section 8, we conclude our paper with the future work.
2 Background and related work 2.1) Background 2.1.1 Paillier cryptosystem. The Paillier cryptosystem is a probabilistic asymmetric algorithm for public key cryptography [18]. In the Paillier cryptosystem, the encryption key pk is given as (N, g), where N is the multiplication value between two large prime numbers p and q in Z N 2. Here, g is a random integer value at Z N 2 where Z N 2 denotes an integer domain ranged from 0 to Z N 2. Meanwhile, the decryption key sk is given as (p, q). The Paillier cryptosystem has the following characteristics. First the Paillier cryptosystem can support homomorphic addition and multiplication. Assume that the encryption function of the Paillier cryptosystem is E(.) and its decryption function is D(.), For two encrypted data E(a) and E(b), the product E (a) × E(b) is equal to E(a+b), which is the encrypted value of the plaintext a+b, as shown in Eq (1).

Eða þ bÞ ¼ EðaÞ � EðbÞ mod n 2 ð1Þ
For two plaintexts a and b, the b t h power of the encrypted data E(a), i.e, E(a) b , is equal to E (a × b), which is the encrypted value of the plaintext a × b, as shown in Eq (2).
Second, the Paillier cryptosystem supports semantic security where only negligible information about the plaintext can be feasibly extracted from the ciphertext. Specifically, any probabilistic, polynomial-time algorithm (PPTA), which is given the ciphertext of a certain message m and its length, cannot determine any partial information on the message with a probability higher than all other PPTA's that only have access to the message length [19]. This concept is the computational complexity similar to Shannon's concept of perfect secrecy. Perfect secrecy means that the ciphertext reveals no information at all about the plaintext, whereas semantic security implies that any information revealed cannot be feasibly extracted.
2.1.2 Attack model. In the outsourcing database environment, two attack models can be considered: a semi-honest attack model and a malicious attack model [20]. In the semi-honest (or honest-but-curious) attack model, the cloud performs its own protocol honestly, but attempts to obtain sensitive data about the data owner and the authorized user during the protocol execution. To prevent a semi-honest attack, sensitive data must always be protected. A malicious attack model attempts to acquire sensitive data by deviating from a given secure protocol. Because a secure protocol can be contaminated by a malicious attack, it is difficult to recover the secure protocol. To protect sensitive data against the malicious attack model, a defender focuses on detecting attacks and recovering the damaged secure protocol. Since we aim at protecting sensitive data in cloud computing, we design our algorithm based on the semi-honest attack model. A secure protocol for the semi-honest attack model is defined as follows [17]. Definition 1. Assuming that a i is the input data of cloud C i , ∏ i (π) is the execution image of C i for the protocol π and b i is the result data of C i executing the π protocol. If the execution image ∏ Si (π) simulating π is computationally indistinguishable from ∏ i (π), the protocol π is said to be a secure protocol for the semi-honest attack model.
In Definition 1, the execution image generally includes the input data and output data of the protocol. The security of the protocol under the semi-honest attack model can be verified by showing that the protocol's execution image does not expose the cloud's data.  [21] based on a partition-based secure Voronoi diagram (SVD) [22]. The SVD relies on any standard encryption scheme E such as public-key encryption RSA and symmetric-key encryption AES, rather than using any new encryption schemes. Because the SVD is as secure as E for any standard security model in which E is proven secure, the SVD is indistinguishable in either chosen plaintext or chosen ciphertext attacks. To process the secure kNN classification queries, the algorithm retrieves the relevant encrypted partition instead of finding the encrypted exact k-nearest neighbors. However, most of the computations are performed locally by the enduser while processing the kNN classification query. As a result, the algorithm conflicts the purpose of outsourcing the DBMS functionalities to the cloud. Furthermore, the algorithm leaks data access patterns to the cloud, such as the partition ID corresponding to a user query.  [16]. PPkNN can protect the confidentiality of the data, user's input query, and data access patterns. PPkNN mainly consists of two stages: the secure retrieval of k-nearest neighbors and the secure computation of majority class. In the secure retrieval of k-nearest neighbors, a query user initially sends his query q (in encrypted form) to C 1 . Then, C 1 and C 2 involve in a set of sub-protocols to securely retrieve the class labels corresponding to the k-nearest neighbors of the input query q. At the end of this step, the encrypted class labels of the k-nearest neighbors are known only to C 1 . In the secure computation of the majority class, C 1 and C 2 jointly compute the class label with majority voting among the k-nearest neighbors of q. At the end of this step, only the query user knows the class label corresponding to input query record q. However, PPkNN requires a very high computation cost for hiding data access patterns.

H. Kim et al.'s work.
H. Kim et. al. proposed a secure kNN classification algorithm which uses both the Paillier cryptosystem and an encrypted kd-tree index [17]. The Paillier cryptosystem is a homomorphic encryption scheme which is indistinguishable in either chosen-plaintext or chosen-ciphertext attacks, so that the cloud can process the kNN classification queries without decrypting any data or a user's query. Before outsourcing data to the cloud, a data owner builds a kd-tree index and encrypts both the original database and the leaf nodes of the kd-tree index. Therefore, the algorithm can protect the data, the query and the data access pattern. By using the encrypted kd-tree index, the algorithm can reduce the amount of query processing time. However, because the algorithm must generate encrypted random values for privacy-preserving, it requires a high computation cost.  [23]. The algorithm newly generates unique classification label keys for each user through a secure three-party protocol. The keys are used to re-encrypt the labels into new ciphertexts that can only be decrypted by the corresponding user. The algorithm hides the data access patterns from a federated cloud server which performs the process of kNN classification by using two non-colluding clouds. However, the algorithm conflicts the purpose of outsourcing the DBMS functionalities to the cloud because both the data owner and authorized users must participate in the process of label re-encryption.

Y. Tan et al.'s work.
Y. Tan et al. proposed a lightweight edge-based privacy-preserving kNN classification algorithm over a hybrid encrypted cloud database [24]. A data owner can upload his/her database to the cloud server, and an authorized user can send a query to the cloud server to execute kNN queries. The algorithm is performed against the semi-honest attack model. After the query is sent, the authorized user does not need to participate in the kNN classification. They also proposed a secure distance protocol in which the cloud servers cannot derive any private information from the authorized user. Compared with the SIP protocol in the state-of-the-art PPKC algorithm [16], the proposed secure distance protocol has less corrupted computation.
2.2.6 J. Du and F. Bian's work. J. Du and F. Bian proposed a non-interactive and efficient privacy-preserving kNN classification algorithm [25]. The algorithm is performed against the semi-honest attack model. To achieve privacy preservation, the algorithm encrypts all outsourced data and users' query records by using two encryption schemes: order preserving encryption [26] and the Paillier cryptosystem [16]. To hide the data access pattern, the information in the cloud server is always maintained in ciphertext format. In terms of classification accuracy, the algorithm is proven to be very close to one using both plaintext data and the non-interactive encrypted data query scheme.

3.1) System architecture
In the outsourcing database environment, two attack models can be considered: a malicious attack model and a semi-honest attack model [20]. In a malicious attack model, the cloud can deviate from the protocol procedure. A protocol against malicious attack model is inefficient because it requires exceedingly high cost. In the semi-honest attack model, the cloud correctly follows the given protocol, but tries to acquire the sensitive information of both the data owner and the query issuer. However, a protocol against a semi-honest attack model is practical because the cloud has a higher level of authority than outsider attackers. Therefore, according to earlier work [16,17], we also adopt the semi-honest attack model. Table 2 shows a list of

PLOS ONE
Privacy-preserving parallel kNN classification algorithm using index-based filtering in cloud computing notations used in this paper. Our system architecture supports secure protocols between clouds by performing Secure Multiparty Computation (SMC). SMC is based on multi-party data processing in which several entities cooperate to perform calculations for deriving specific results. For this, the following factors must be satisfied to achieve the result of secure protocols while avoiding data leakage.

Input privacy.
No information about private data held by multiple parties can be inferred from the messages sent during the protocol execution. The only information that can be inferred about private data is whatever could be inferred from seeing the output of the function alone.
3.1.2 Correctness. Any proper subset of adversarial colluding parties that is willing to share information or deviate from the instructions during the protocol execution should not be able to force honest parties to output an incorrect result. This correctness goal comes in two categories: either the honest parties are guaranteed to compute the correct output (a robust SMC protocol), or the honest parties abort if they find an error (an SMC protocol with abort). Fig 2 shows the overall system architecture. The data owner holds the original database T consisting of n records t i (1 � i � n). Each record t i includes m attributes (or columns) and one label. Here, we call the j th attribute of the i th record as t i,j (1 � i � n, 1 � j � m + 1). First, the data owner partitions the original data by using the kd-tree index. Assuming that the level of the constructed kd-tree is h, the total number of leaf nodes is 2 h−1 . In the leaf node, an attribute stores its region information, i.e., a lower bound lb z,j and an upper bound ub z,j , where 1 � z � 2 h−1 and 1 � j � m. Second, the data owner generates an encryption public key (pk) and a decryption secret key (sk) based on the Paillier cryptosystem [18]. Third, the data owner encrypts the database with the Paillier cryptosystem to protect the original data. Because the unit of the encryption is the attribute of each record, E(t i,j ) (1 � i � n, 1 � j � m + 1) is generated. Finally, the leaf node of the constructed kd-tree is encrypted because the data owner needs to protect the data access pattern. Because the unit of the encryption is the attributes of each leaf node, E(lb z,j ) and E(ub z,j ) are generated (

Encrypted random value pool.
To support data privacy in a cloud computing environment, the existing works [16,17] prevent C B from extracting meaningful information (Fig  2) while executing a secure protocol by using the Paillier cryptosystem. However, they require high computation cost because the secure protocol generates an encrypted random value for protecting the original data. Therefore, we propose an encrypted random value pool to reduce the computation cost for encryption. Before C A processes a query (Fig 2), we generate the random plaintext from Z N and store the encrypted random plaintext into an encrypted random value pool. While processing a query in C A , a random ciphertext is selected from the encrypted random value pool whenever a secure protocol is called. Therefore, while processing a secure protocol, C A not only prevents C B from extracting meaningful information, but also reduces the cost of generating encrypted random values. Table 3 shows a comparison of the number of data encryptions for each secure protocol in our work and existing works [16,17]. 3.2.2 Secure multiplication protocol using an encrypted random value pool. We propose a Secure Multiplication protocol using an Encrypted random value pool (SME protocol) which multiplies two encrypted values E(α) and E(β). Algorithm 1 shows the SME protocol. First, when two encrypted values E(α) and E(β) are given as inputs, C A selects two random values E(r a ) and E(r b ) from the encrypted random value pool (line 1). Second, C A calculates E(α + r a ) and E(β + r b ) by using Eq (1), then sends them to C B (line 2*3). Third, C B decrypts E(α + r a ) and E(β + r b ) by using the secret key and calculates the multiplication of the two plaintext α + r a and β + r b (line 4). Fourth, C B encrypts (α + r a ) × (β + r b ) and send it to C A (line 5). Finally, C A obtains E(α × β) by removing α × r b , β × r a and r a × r b from the received value, where 'N −x ' in the Z N domain is the same as '-x' (line 6).
Algorithm 1 SME Protocol

Garbled secure compare protocol using encrypted random value pool.
We propose the Garbled Secure Compare protocol using an Encrypted random value pool (GSCE protocol) which is performed by using a garbled circuit consisting of two ADD gates and one CMP gate [27]. Assume that E(u) and E(v) are ciphertext for two plaintext u and v. When E(u) and E(v) are given to C A , the GSCE protocol returns E(1) if u � v is satisfied, otherwise it returns E(0). Algorithm 2 shows the GSCE protocol. First, C A selects two random value E(r u ) and E(r v ) from the encrypted random value pool (line 1). Second, . Third, C A randomly selects one of two random functions, i.e., F 0 and F 1 . The selected random function is not disclosed to . Fourth, C B decrypts the data received from C A (line 8*11). When C A selects F 0 , C B acquires an ordered pair <m 2 , m 1 >, otherwise C B acquires an ordered pair <m 1 , m 2 >. Fifth, C A creates a garbled circuit consisting of two ADD gates and one CMP gate. If F 0 is selected, −r v and −r u are transferred to the first ADD gate and the second ADD gate, respectively. Otherwise, −r u and −r v are transferred to the first and the second ADD gates, respectively (lines 12*16). Sixth, C B transfers the first data to the first ADD gate, and the second data to the second ADD gate. Therefore, when F 0 is selected, C B transfers m 2 and m 1 to the first and the second ADD gates, respectively. Otherwise, m 1 and m 2 are transferred to the first and the second ADD gates, respectively (line 17 *20). Seventh, the first ADD gate adds two input values: −r v and m 2 for F 0 and −r u and m1 for F 1 . The result of the first ADD gate (result 1 ) is transferred to the CMP gate (line 21*24). Eighth, the second ADD gate adds two input values: −r u and m 1 for F 0 and −r v and m 2 for F 1 . The result of the second ADD gate (result 2 ) is transferred to the CMP gate (line 25*28). Due to the characteristics of the garbled circuit, the exposure of any information does not occur in the ADD gate. Ninth, the CMP gate returns α = 1 if result 1 � result 2 , and α = 0 otherwise (line 29 30). Finally, the result α can be checked on C B side, and C B transmits E(α) to C A (line 31). Because C B does not know whether F 0 or F 1 is selected by C A , C B cannot determine the result of comparison of E(u) and E(v). When F 0 is selected, C A changes E(α) through the SBN protocol [11] and returns E(α) (line 32*34). Here, C A cannot obtain the actual value of α due to the characteristics of the Pallier cryptosystem.

Privacy-preserving parallel kNN classification algorithm using index filtering
The proposed parallel kNN classification algorithm can support the protection of data, query, and data access pattern in a cloud computing environment. For this, the proposed privacy-preserving parallel kNN classification algorithm is composed of four phases: secure index search, k-nearest neighbors search, kNN verification, and kNN classification, as shown in

4.1) Secure index search phase
In the secure index search phase, the proposed algorithm determines the leaf node which includes the given query in the encrypted kd-tree. The procedure of the secure index search is shown in Algorithm 3. First, C A makes t number of partitions and allocates them to the given threads (line 1). Here, t is calculated by dividing the number of leaf nodes by the number of threads. Second, by using the GSRO protocol, the algorithm finds which leaf node includes the query in each thread. If a node includes the query, the GSRO protocol returns E (1), otherwise the protocol returns E(0). The result of the GSRO protocol is stored in an array E(α). The algorithm randomly reorders the members of the array E(α) and transfers the reordered array E(α 0 ) to C B (line 2*7). Third, C B decrypts the array E(α 0 ) and makes groups by allocating the decrypted members uniformly based on the number of 1s. If a node has the decrypted value of 1, it becomes a seed of a group. C B sends groups to C A (line [8][9][10][11][12][13][14][15]. Finally, C A extracts all the encrypted data in the node corresponding to E(1). If a node has E (1), the algorithm can safely extract the data of the node because the node includes the query. Otherwise, the algorithm can remove the data of the node because it does not include the query (line 16*30).

4.2) k-Nearest neighbors search phase
In the k-nearest neighbors search phase, our algorithm finds k-nearest points among the encrypted candidates which are extracted from the index search phase. The procedure of the k-nearest neighbors search is shown in Algorithm 4. First, C A calculates the squared Euclidean distance set E(d i ) (1 � i � cnt, where cnt is the number of candidates) between the query and the encrypted candidates through the ESSED protocol [17] in a parallel way (line 1*6). Second, C A finds the minimum value E(d min ) among E(d i )(1 � i � cnt) through the SMS n protocol [16] (line [8][9][10]. Additionally, C A calculates the difference between E(d min ) and E(d i ) (1 � i � cnt) by using E(d min ) × E(d i ) N−1 , and stores the results into an array E( by raising E(τ i ) to the power of a random integer. C A makes E(β i ) (1 � i � cnt) by applying a shuffling function π to Eðt 0 i Þ (1 � i � cnt) and sends it to C B (line 11*18). Therefore, the original distance and data access patterns are protected from where m is the data dimension). C A stores the result of the SM protocol in a temporary array EðV 0 i;j )(1 � i � cnt and 1 � j � m + 1). Next, C A calculates Eq 3 by using Eq 1 (line 23*31). Fifth, if the algorithm does not find k-nearest neighbors, for t × (i − 1)�j � t × i 36: If s < k then, 37: Terminate thread 39: return E(t 0 )

4.3) k-Nearest neighbors verification phase
In the k-nearest neighbors verification phase, the algorithm verifies whether the distance between the a node and the query(E(q) = <E(q 1 ), E(q 2 ), . . ., E(q m )>, where m is the data dimension) is shorter than the distance, E(dist k ), between the query and k th nearest neighbor (Eðt 0 Þ ¼< Eðt 0 k;1 Þ; Eðt 0 k;2 Þ; . . . ; Eðt 0 k;m Þ >). The procedure of the k-nearest neighbors verification phase is shown in Algorithm 5. First, C A calculates E(dist k ) between E(q) and Eðt 0 k Þ using the ESSED protocol (line 1). Second, the algorithm performs the GSCE protocol between E(q j ) and the lower bound of node z (E(node z .lb j ) (1 � z � num node ) for each dimension j(1 � j � m), and stores the result of the GSCE protocol into E(ψ 1,j ). If E(q j ) (1 � j � m) is less than or equal to E(node z .lb j ), E(ψ 1,j ) is E(1). Then, the algorithm performs the GSCE protocol between E(q j ) (1 � j � m) and the upper bound of node z (E(node z .ub j )(1 � z � num node ) for each dimension j, and stores the result of the GSCE protocol into E(ψ 2,j ) (line 2*5). If E(q j ) is less than or equal to E(node z .ub j ), E(ψ 2,j ) is E(1). Third, the algorithm performs the SBXOR protocol [16] between E(ψ 1,j ) and E(ψ 2,j ), and stores the result of the SBXOR protocol into E(ψ 3,j ) (line 6). Fourth, the algorithm calculates the shortest point of node z (1 � z � num node ), E(sp z ) = < E (sp z,1 ), E(sp z,2 ), . . ., E(sp z,m ) > where m is the data dimension, by using Eqs 5 and 6 (line 7*10). f ðEðlb j Þ; Eðub j ÞÞ ¼ SMðEðc 1;j Þ; Eðlb j ÞÞ � SMðSBNðc 1;j Þ; Eðub j ÞÞ ð5Þ Eðsp z;j Þ ¼ SMðEðc 3;j Þ; Eðq j ÞÞ � SMðSBNðEðc 3;j ÞÞ; f ðEðlb j Þ; Eðub j ÞÞÞ ð6Þ Fifth, C A calculates the squared Euclidean distance between E(q) and E(sp z )(1 � z � numnode ) through the ESSED protocol and stores the result into the shortest distance of the node z , E(spdist z )(1 � z � num node ) (line 11). In addition, C A updates E(spdist z )(1 � z � num node ) by using Eq 7 (line 12 13). E(α z ) in Eq 7 is the result of the GSRO protocol in algorithm 1. This update avoids an unnecessary index search phase by updating the shortest distance of the node already searched in the previous phase.
Eðspdist z Þ SMðEða z Þ; EðmaxÞÞ � SMðSBNðEða z ÞÞ; Eðspdist z ÞÞ ð7Þ Sixth, C A performs the GSCE protocol between E(spdist z ) and E(dist k ), and stores the result into E(α z ) (line 14). If E(spdist z ) is less than E(dist k ), the node z needs additional searching. Finally, by performing lines 9*33 of the secure index search phase, C A extracts the encrypted data belonging to the node z and adds them to E(t 0 ). In addition, C A obtains the kNN result array, E(result i )(1 � i � k), by performing the k-nearest neighbors search phase (line 15*17). C A stores the label of the k-nearest neighbors into EðL 0 i Þð1 � i � kÞ (line 18*19).

4.4) k-Nearest neighbors classification phase
In the kNN classification phase, the algorithm extracts the most frequent label from the label of the k-nearest neighbors, EðL 0 i Þð1 � i � kÞ. The procedure of the kNN classification phase is shown in Algorithm 6. C A and C B calculate the frequency of EðL 0 i Þð 1 � i � kÞ by using the secure frequency protocol [17] (line 1). The label with the highest frequency is selected (line 2). C A adds a random integer r q to the selected label and stores the result into a temporary variable E(r q ) (line 3). C A sends E(r q ) to C B and r q to AU (line 4). C B decrypts E(r q ) and sends it to AU (line 5-6). AU obtains the final result by combining the results of C A and C B (line 7*8). 05: Receive E(λ q ) from C A 06: l 0 q ¼ Dðl q Þ; Send l 0 q to AU AU: 07: Receive r q from C A and λ q from C B 08: c q ¼ l 0 q À r q

4.5) Example of kNN classification
Here, an example of the proposed secure kNN classification algorithm is described. Assume that the original data is indexed and encrypted by using the kd-tree, as shown in Fig 4. The encrypted kd-tree contains 4-fold attributes for each leaf node, i.e., a node identifier (ID), an encrypted lower bound of the node, an encrypted upper bound of the node, and the encrypted data. Fig 5 shows how to extract data in a selected node through the secure index search phase. First, C A sends a node identifier (ID), an encrypted lower bound, an encrypted upper bound, an encrypted query to all the threads. In each thread, the algorithm performs the GSRO protocol to determine whether a node includes the query or not. If a node includes the query, the GSRO protocol returns E(1). Otherwise it returns E(0). Second, the algorithm performs the RSM protocol by multiplying the encrypted data in each node(E(node z .data)) and the results of the GSRO protocol. As a result, E(node z .data) is returned only if the result of the GSRO is E (1). Finally, the algorithm can safely obtain the encrypted data by merging the results of the RSM protocol. Fig 6 shows how to obtain kNN candidates through the k-nearest neighbors search phase. First, the algorithm selects the encrypted data which has the minimum distance from the query by using the GSMIN n protocol. In Fig 6, E(d 3 ) is selected as 1NN because the distance of d 3 is the minimum. Second, the algorithm sets the distance of the selected data to the maximum value for excluding the selected data. Therefore, the distance of E(d 3 ) is set to E (MAX). Finally, the algorithm is repeated until the k th nearest data is selected. In the same way, E(d 4 ) and E(d 2 ) are selected as 2NN and 3NN, respectively. As a result, the algorithm can safely select the k number of nearest neighbors. Figs 7 and 8 show the examples of index search and k-nearest neighbor search in the kNN verification phase, respectively. In each thread, the algorithm calculates the shortest distance E(spdist z ) between the query and a leaf node(node z ), and compares E(spdist z ) with E(dist k ). If E(spdist z ) is smaller than E(dist k ), the data in the node z is extracted. In Fig 7, because E(spdist 2 ), i.e., (E(1)), is smaller than E(dist k ), i.e., (E(5)), node 2 is selected. Fig 8 shows

5.1) Security proof of the secure protocols
In this section, we describe the security proof of the SME and the GSCE protocols proposed in Section 3. To prove that the proposed protocols are secure under the semi-honest model, we show that the simulated images of the proposed protocols are computationally indistinguishable from their actual execution images. Security proof of the SME protocol: We describe the security proof of the SME protocol by analyzing the security of the execution images of C A and C B . First, the execution image on C B side, i.e., Q C B ðSMEÞ, is shown in Eq 8. Here, Eðv 0 1 Þ and Eðv 0 2 Þ are the encrypted data received from C A (line 1*2 of Algorithm 1), v 0 1 and v 0 2 and are obtained through the decryption of Eðv 0 1 Þ and Eðv 0 2 Þ, respectively. Also, α is a result which is calculated by the SME protocol using v 0 For example, assume that Q C B s ðSMEÞ ¼ f< Eðs 0 1 Þ; Eðs 0 2 Þ; s 0 1 ; s 0 2 >; s 3 g is the simulated execution image using the SME protocol on C B side. Here, Eðs 0 1 Þ and Eðs 0 2 Þ are the non-deterministic numbers selected in Z N 2, and s 0 1 and s 0 2 are the indistinguishable numbers which are added by each value in the random value pool. s 0 3 is the result of the SME protocol using s 0 1 and s 0 2 on C B side. Because the SME protocol is implemented based on the Paillier cryptosystem, it can 2 ), C B cannot obtain the original data while performing the SME protocol. Meanwhile, the execution image of C A is Q C A ðSMEÞ ¼ fEðaÞg such that E(α) from C B can be regarded as the result of the SME protocol. Suppose that the simulated image of C A is Q C A S ðSMEÞ ¼ fEðs 4 Þg, where E(s 4 ) is randomly generated from Z N 2. Therefore, E(α) is computationally indistinguishable from E(s 4 ). According to the above analyses, there is no information leakage both at C A and C B side. Therefore, we can conclude that the proposed SME protocol is secure under the semi-honest adversarial model. Security proof of the GSCE protocol: We describe the security proof of the GSCE protocol by analyzing the security of the execution images of C A side and C B side. First, the execution image on C B side, i.e., Q C B ðGSCEÞ, is shown in Eq 9. Here, Eðs 0 1 Þ and Eðs 0 2 Þ refer to the encrypted data received from C A (line 1*2 of Algorithm 2), and both s 0 1 and s 0 2 are obtained through decryption of s 0 1 and s 0 2 , respectively. Also, β is the result which is calculated by the GSCE protocol using s 0 1 and s 0 2 on C B side. Y For example, assume that C B S ðGSCEÞ ¼ f< Eðs 0 1 Þ; Eðs 0 2 Þ; s 0 1 ; s 0 2 >; s 3 g for the simulated execution image using the GSCE protocol on C B side. Here, Eðs 0 1 Þ and Eðs 0 2 Þ are the non-deterministic numbers selected in Z N 2, and both s 0 1 and s 0 2 are the indistinguishable numbers selected in the random value pool. s 0 3 is the result of the GSCE protocol using s 0 1 and s 0 2 on C B side. Because the GSCE protocol is implemented based on the Paillier cryptosystem, it can support semantic security. Therefore, Eðs 0 1 Þ and Eðs 0 2 Þ are computationally indistinguishable from s 0 1 and s 0 2 , respectively. s 0 3 is indistinguishable from s 0 1 and s 0 2 because s 0 3 is calculated by comparing two indistinguishable numbers in C A , s 0 1 and s 0 2 . Therefore, it can be said that can check only the result (e.g., β) of the comparison between the non-deterministic numbers (e.g., s 0 1 and s 0 2 ), C B cannot obtain the original data while performing the GSCE protocol. Meanwhile, the execution image of C A is Q C A ðGSCEÞ ¼ EðbÞ such that E(β) from C B can be regarded as the result of the GSCE protocol. Suppose that the simulated image of Therefore, E(β) is computationally indistinguishable from E(s 4 ). According to the above analyses, there is no information leakage both at C A and C B side. Therefore, we can conclude that the proposed GSCE protocol is secure under the semi-honest adversarial model

5.2) Security proof of the proposed kNN classification algorithm
We prove that the proposed kNN classification algorithm on the encrypted database is safe under the semi-honest attack model. The proposed kNN classification algorithm in the cryptographic database consists of a secure index search phase (Algorithm 3), a kNN search phase (Algorithm 4), a kNN verification phase (Algorithm 5), and a kNN classification phase (Algorithm 6). To show that the proposed secure kNN classification algorithm is safe under the semi-honest attack model, security analysis is performed at each execution phase. First, because the secure index search phase is composed of the GSRO protocol [17] which has been proven to be safe, the Algorithm 3 is safe under the semi-honest attack model by composition theory [17]. Second, the kNN search phase is safe in C A side, because C A performs the ESSED, SMIN n and SM protocols which have been proven to be safe in the previous studies [16,17]. Even though the kNN search phase decrypts the received data from C A , C B cannot extract the original data. This is because the data received from C A is modified by raising the original data to the power of a random integer and applying a shuffling function. Therefore, according to the composition theory, Algorithm 4 is safe under the semi-honest attack model. Third, the images which are generated by the kNN verification phase are the same as those generated by Algorithms 3 and 4. Therefore the kNN verification phase (Algorithm 5) is safe under the semi-honest attack model. Lastly, the kNN classification phase (Algorithm 6) is safe under the semi-honest attack model because Algorithm 6 has been proven safe in the previous work [16,17]. As a result, all the phases of the proposed secure kNN classification algorithm is safe under the semi-honest attack model.

Performance analysis
Because there is no privacy-preserving parallel kNN Classification algorithm, we compare our privacy-preserving parallel kNN classification algorithm with the extension of existing works. That is, we make parallel SkNNC-M by extending B. K. Samanthula et. al.'s work [16] in a naive way so that it may operate in a multi-core environment. We make parallel SkNNC-G by extending H. J. KIM et. al.'s work [17] in the same way. For performance evaluation, three algorithms were implemented by using C++ under an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and 64GB (16GB × 4AE) DDR3 UDIMM 1600MHz in a Linux Ubuntu 18.04.2 environment. We compare three parallel algorithms in terms of the query processing time by varying the number of data, the number of k, the level of the kd-tree, the number of the data dimension, and the number of threads. We use both a synthetic dataset and real dataset [28] for our experiments. Table 4 shows the parameters used in the performance evaluation for the synthetic dataset. For the synthetic dataset, we randomly generate 30,000 integer data with 12 dimensions. The domain of data is ranged from 0 to 212. We do an experiment to find the optimal value of the level of kd-tree(h). It is shown that the performances of both SkNNC-G and the proposed algorithm are best when h is 7. So, we set h to 7 in our experiment. The performance of the kNN classification algorithms is evaluated for synthetic data.  according to the number of data. When n = 30k, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 215, 497, and 7,089 seconds, respectively. That is, the proposed algorithm shows 2.3 times better performance than parallel SkNNC-G and 32 times better performance than parallel SkNNC-M. This is because our secure protocols (SME and GSCE protocols) can reduce the number of data encryptions by selecting an encrypted value from the random value pool instead of generating it, as mentioned in Table 3. Fig 11 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to k. When k = 20, the proposed parallel algorithm, parallel SkNNC-G, and parallel SkNNC-M require 202, 487, and 4,658 seconds, respectively. That is, the proposed algorithm shows 2.4 times better performance than parallel SkNNC-G and 23 times better performance than parallel SkNNC-M. The reason is the same as mentioned in Fig 10. Fig 12 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of data dimension(m). When m = 6, the proposed parallel algorithm, parallel SkNNC-G, and parallel SkNNC-M require 57, 112, and 2,353 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 15 times better performance than parallel SkNNC-M. The reason is the same as mentioned in Fig 10. Fig 13 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of threads. When the number of threads = 1(single-core), the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 443, 894, and 15,572 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 35 times better performance than parallel SkNNC-M. This is because our secure protocols (SME and GSCE protocols) can reduce the number of data encryptions by selecting an encrypted value from the random value pool instead of generating it. When the number of threads = 10, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 93, 203, and 2350 seconds, respectively. That is, the proposed algorithm shows 2.1 times better performance than parallel SkNNC-G and 25 times better performance than parallel SkNNC-M. Because a thread performs secure protocols concurrently without interfering with each other, query processing time linearly decreases as the number of threads increases. As a result, our parallel algorithm shows better performance than the existing algorithms in a multi-core environment.

PLOS ONE
Privacy-preserving parallel kNN classification algorithm using index-based filtering in cloud computing Table 5 shows the parameters used in the performance evaluation for real data. For this, we used a chess dataset [28] generated by a chess endgame database for white king and rook against black king. The chess dataset aims to classify the optimal depth of win for white. With the real dataset, we do an experiment to find the optimal value of the level of kd-tree(h). It is shown that the performances of both SkNNC-G and the proposed algorithm are best when h is 7. So, we set h to 7 in our experiment. Fig 14 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to k. When k = 20, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 425, 894, and 13,175 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 27 times better performance than parallel SkNNC-M. This is because our algorithm uses both SME and GSCE protocols which can reduce the number of data encryptions by selecting an encrypted value from the random value pool. Fig 15 shows the performance of the proposed algorithm, parallel SkNNC-M, and parallel SkNNC-G according to the number of threads. When the number of threads = 1 (single-core), the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 1106, 2306, and 44,570 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 40 times better performance than parallel SkNNC-M. The reason is the same as mentioned in Fig 14. When the number of threads = 10, the proposed algorithm, parallel SkNNC-G, and parallel SkNNC-M require 227, 487, and 6,639 seconds, respectively. That is, the proposed algorithm shows 2 times better performance than parallel SkNNC-G and 29 times better performance than parallel SkNNC-M. Because a thread performs secure protocols concurrently without any interference of each other, it can be seen that query processing time linearly decreases as the number of threads increases.

6.3) Theoretical analysis of the proposed algorithm in terms of privacy
Assuming that an attacker does not have any information of original data items, an adversary needs tremendous time to obtain the original plaintext from paillier cryptosystem while using a brute force attack. It means that it is impossible to do an experiment to prove data protection, query protection and access pattern protection. Therefore, instead of experimental analysis, we conduct the theoretical analysis of data privacy, query privacy and access pattern privacy to support the security analysis of the proposed algorithm. For this, we estimate the time complexity it takes for the original data to be exposed and calculate the probability of access pattern leakage. 6.3.1 Theoretical analysis of data privacy. In C A , an attacker only obtains the ciphertext of data. Because the data is protected by the paillier cryptosystem, the security performance is measured through the time complexity of the brute force attack to break down the paillier cryptosystem. Our paillier cryptosystem uses 512-bit encryption key size. Assuming that CPU cycle is 4GHz, the time required to decrypt the ciphertext by changing the key is as shown in Eq (10). It is impossible to break down a paillier cryptosystem because it takes about 4.2 × 10 146 years with 512-bit key size. It means that the proposed privacy preserving kNN classification algorithm is secure in terms of data privacy even if the ciphertext is exposed. Fig 16 shows the time taken for a brute force attack in C A as the key size is changed. In C B , an attacker only obtains a plaintext data which adds a random number to the original data. In the paillier cryptosystem, because the range of the plaintext data is 0 � m � 2 512 , brute force attack time in C B has the same as that in C A . 6.3.2 Theoretical analysis of query privacy. In C A , an attacker only obtains the ciphertext of query. Because the query is protected by the paillier cryptosystem, the security performance is measured through the time complexity of the brute force attack to break down the paillier cryptosystem. Since our paillier cryptosystem uses 512-bit encryption key size, the time required to decrypt the ciphertext by changing the key is as shown in 10, where CPU cycle is 4GHz. It is impossible to break down a paillier cryptosystem because it takes about 4.2 × 10 146 years with 512-bit key size. It means that the proposed privacy preserving kNN classification algorithm is secure in terms of query privacy even if the ciphertext is exposed. The times taken for a brute force attack in C A is the same as that of data privacy in C A (Fig 16). In C B , query privacy is preserved because C B does not receive the query.

Theoretical analysis of access pattern privacy.
The access pattern means the sequence of accessing a data item. In the proposed algorithm, the sequence of accessing a data item consists of the leaf node access of kd-tree and data access in the leaf node. In C A , an attacker only obtains the ciphertext of leaf node. Because all the leaf nodes have the same number of data items, an attacker cannot distinguish the leaf node by using density of data items. If the kd-tree level is h, the number of leaf node is 2 h−1 . The probability that an attacker can distinguish a node(node i ) from the others, i.e., P(node i ), is 1 2 hÀ 1 . Because node i includes the same number of data items as fanout, the probability that an attacker can distinguish a data item from the others in node i , i.e., P(node i .data j ), is 1 fanout ¼ probability of data access pattern leakage (P APL ) is shown in Eq (11).
P APL is equal to the probability that an attacker distinguishes a specific data item from the others in the entire data items. Therefore, the proposed algorithm can preserve the access pattern privacy in C A . In C B , access pattern privacy is preserved because C B does not have any data item.

Impact of hiding data access patterns
The data access pattern is one of the most important factors for privacy preservation. If an attacker possesses the order or the frequency of data, he/she can infer the original data by using data access patterns. Therefore, hiding data access patterns is as important as encrypting data. First, B. Yao et al.'s work [21] proposed a secure kNN classification algorithm using the Voronoi diagram [22]. However, the order of accessing the Voronoi diagram is distinguishable and an attacker can partially infer the original data from the query. Second, J. Du and F. Bian's work [25] proposed a kNN classification algorithm using an order-preserving index. However, the index access patterns are exposed because the order of accessing the index can be easily obtained from the query. This allows an attacker to easily infer the original data if he/she has an index access pattern. Meanwhile, our algorithm uses the Paillier cryptosystem which supports semantic security for data protection. As a result, all of the ciphertext is indistinguishable and secure from frequency-based attacks. In addition, the kd-tree filtering technique used in our algorithm is secure from the exposure of data access patterns because our algorithm accesses only the encrypted leaf nodes of the kd-tree without accessing the index by using a top-down approach. Therefore, our algorithm can hide the data access patterns.

Impact of parallel algorithm with garbled circuit
First, a garbled circuit is used for efficient processing of secure protocols. B. K. Samanthual et al.'s work [16] has high overhead by using a secure protocol based on the comparison of binary array. To overcome this problem, our secure protocols use a garbled circuit that performs a fast and secure comparison operation in the state of the ciphertext. Second, the existing algorithms do not use parallelism for the privacy-preserving classification algorithm [16,17,25]. On the contrary, our algorithm proposes a parallel classification algorithm adopting the garbled circuit. Our algorithm performs three phases in parallel: index searching, kNN searching and kNN verification. As shown in our performance evaluation, our parallel classification algorithm shows performance improvement in proportion to the number of threads.

Impact of encrypted random value pool
In our secure system, we use two-party computation for the parallel kNN classification algorithm. Thus, we need to prevent C B from extracting meaningful information while executing secure protocols. For this, C A generates a random value r from Z N and encrypts r by using the Paillier cryptosystem. Then, C A adds the encrypted random value E(r) to the encrypted plaintext E(m) by computing E(m + r) = E(m) × E(r). Because m±r is independent from m, C B cannot obtain meaningful information with decryption. However, adding a random value to the ciphertext in the Paillier cryptosystem leads to performance degradation because both encryption and decryption operations require higher computation cost than other encrypted Meanwhile, our algorithm requires only one encryption for the result of comparison at C B by using the random value pool. Therefore, our algorithm can reduce the amount of computation cost for encryption by using the encrypted random value pool.

Practical example of proposed kNN classification
The proposed secure kNN classification algorithm can be used in various fields. For example, first, it can be used to diagnose a disease by classifying the patterns of the patient's symptoms [29]. Because the existing disease diagnosis system depends on only the doctor's knowledge and experience, it may cause damage to patients due to misdiagnosis. Therefore, kNN classification algorithms can help doctors classify the pattern of the patient's symptoms so as to diagnose what kind of disease it is. However, because patients' information contains sensitive data, such as past medical history, family history and allergies, the proposed privacy-preserving kNN classification algorithm can be used to protect the sensitive data of patients. Second, the proposed privacy-preserving kNN classification algorithm can be used to solve the problem of insurance coverage recommendation where insurance companies provide the most suitable coverage for customers [30]. The insurance coverage recommendation classifies customers' grades based on various customers' information, such as movement patterns and lifestyles. To perform the classification of customers' grades, the proposed privacy-preserving kNN classification algorithm can be used to protect the personal information of customers.

Conclusion
In this paper, we proposed a parallel kNN classification algorithm over encrypted data to preserve data privacy, query privacy, and access pattern privacy in cloud computing. To reduce the computation cost for encryption, we proposed two secure protocols, SME and GSCE, which support secure multi-party computation by using an encrypted random value pool. To reduce the query processing time, we not only designed a parallel algorithm, but also adopted a garbled circuit. In addition, we proved that our algorithm over the encrypted database is safe under the semi-honest attack model. Through our performance evaluation, our algorithm showed about 2*25 times better performance compared with the existing algorithms. For future work, we plan to apply our parallel query processing algorithm to secure k-Means clustering.